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We explore an approach due to Nievergelt of decomposing a time-evolution equation along 
the time dimension and solving it in parallel with as little communication as possible between 
the processors. This method computes a map from initial conditions to final conditions locally 
on slices of the time domain, and then patches these operators together into a global solution 
using a single communication step. A basic error analysis is given, and some comparisons are 
made with other parallel in time methods. Based on the assumption that parallel computation 
is cheap but communication is very expensive, it is shown that this method can be competitive 
for some problems. We present numerical simulations on graphics chips and on traditional 
parallel clusters using hundreds of processors for a variety of problems to show the practicality 
and scalability of the proposed method. 
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1. Introduction 



Consider two different computations. 



Use forward Euler to advance a single scalar ODE by a billion tiniesteps, (1) 



and 



% 



Use forward Euler to advance a billion uncoupled ODEs by one timestep. (2) 

Implemented in the standard, straightforward way, the problems ([T]) and ([2|) require 
about the same number of floating point operations. However, each operation in 
([1]) depends on the one before it, while the operations of ([2]) can all be performed 
independently. In particular, ([1]) relies on the clock speed of a single processor, while 
([2]) depends on the number of processors. Hardware is moving in the direction of 
increased concurrency and parallelism, and computational science is moving with 
it, so that processing units are proliferating and becoming very cheap, while clock 
speed will remain expensive. 

The catch in the description of ([2]) is the word "uncoupled." For useful problems 
in scientific computing, problems are rarely completely uncoupled, which means 
that different parallel processes will have to communicate somehow. That leads us 
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to consider a third class of problem, namely 

Add a billion numbers, each in distinct (and distant) physical locations. (3) 

In addition to the floating point operations, problems of type ([3]) require a great 
deal of communication, which is already expensive on current hardware. On future 
hardware we anticipate that this cost will be the dominant one, and that good 
algorithms will avoid problems of this type as much as possible. 

In this paper we take these trends in hardware seriously, perhaps even taking 
them to an extreme. In particular, we assume that computations of type ([2]) are 
not just very cheap but essentially free. We assume that problems of type ([I]) are 
more expensive. Most importantly, we assume that problems of type ([3]), namely 
communication, are very expensive, so expensive that we will do a thousand or a 
million problems of type ([2]) in order to avoid doing one problem of type ([3]). 

In 1964 Nievergelt presented an approach to the solution of time-dependent dif- 
ferential equations that divides the time domain into slices which are assigned to 
different processors [201]. His approach never attracted much attention because of 
its high computational cost in terms of floating point operations, but we revisit it 
here with an eye toward communication cost instead, a factor which was not pre- 
dicted to be important when Nievergelt was writing. We present three numerical 
examples for this method (which we believe are the first actual parallel implemen- 
tations of the method), and we argue that the Nievergelt approach minimizes total 
communication among all parallel in time algorithms. 

The basic idea of the Nievergelt approach is to construct a propagation operator 
on each slice, that is, a map from initial conditions to final conditions, in a way 
that is embarrassingly parallel but requires a great deal of computation. These 
maps are composed with a single reduce operation of type ^ at the very end of 
the computation. As presented here this method has only limited applicability and 
in particular is difficult to apply to large systems of nonlinear equations. However, 
because of its simplicity and its lack of communication, we present it as an instruc- 
tive example of what kinds of algorithms will minimize communication in future 
computing environments where parallel computations will become much cheaper 
in comparison to communication. 

2. Some context 

Most parallel algorithms for evolution equations exploit parallelism in space, often 
with very impressive speedup and scalability. Here we focus on parallelism in time, 
and in particular we will focus on applications where the computational resources 
are abundant and total time to solution is the primary concern. 

One way to exploit parallel computation in the time integration is to use a 
predictor-corrector framework where the correctors run in parallel one or more 
steps behind the predictors [2] . The primary purpose of parallelism in this case is 
to improve the accuracy of the time integration rather than to speed it up, and 
this method is not suitable for use on a large number of processors. 

Some methods based on the matrix exponential are well-suited to parallelism 
and require minimal communication between processors. One early effort in this 
direction is from Gallopoulos and Saad [7|], who describe a parallel Krylov sub- 
space method for approximating the matrix exponential. A more recent approach 
is the paraexp method of Gander and Giittel for linear initial value problems with 
constant coefficients [lj|, which uses a time-domain decomposition together with 
a "near-optimal exponential propagator" to parallelize the solve, with results that 
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show good speedup on up to eight processors. A similar method is due to Hochbruck 
and Ostermann [15l |. where again a Unear initial value problem can be split into 
several independent problems, where the division is based on quadrature of the 
variation-of-constants formula. These approaches depend on the matrix exponen- 
tial, so that they are not suitable for problems with a coefficient that varies in 
time, while we will see that the Nievergelt approach is suitable for such a problem. 
Similarly, the contour integral quadrature approach of Sheen et. al. [2]| relies on 
the Laplace transform and so cannot be used with varying coefficients, and in addi- 
tion this method is restricted to parabolic problems. The similar Dunford-Cauchy 
integral approach to approximating the matrix exponential in [13] has similar re- 
strictions. We also mention that none of the above papers include numerical results 
on more than eight parallel processors. 

The parareal scheme has been applied on a large number of processors, is suit- 
able for non-constant coefficient problems, and has been used very effectively to 
get speedup for a variety of time-dependent problems [12|Jl6l. Il8l|. In this approach 



the time domain is split into subintervals which are each assigned to processors. An 
iteration involving a fine time-stepping scheme (which determines the desired ac- 
curacy) and a coarse scheme (which facilitates fast transfer of information between 
subdomains) is repeated until the error is small. The parareal methodology has 
been applied to many types of problems with great success, including for problems 
in structural and fluid dynamics in [J] and [6|, for an ocean model in [l7[, and for 
reaction-diffusion problems in [31], among many other practical and more theoret- 
ical studies. The computational cost of the parareal method can be impressively 
reduced by combining it with the Spectral Deferred Correction method, but this 
does nothing to reduce the cost of communication [l9| . 

The approach of Nievergelt that we discuss here is similar to the parareal method 
in terms of the target applications, namely small systems of differential equations 
integrated over a very long time interval where spatial parallelization is not easy, 
but computational resources are abundant. However, the Nievergelt approach re- 
quires the same amount of communication as a single iteration of the parareal 
method, measured in terms of total amount of data communicated (or in terms of 
number of sends and receives). Although minimal communication is our main goal, 
we also note that this method will not require a coarse propagator, something that 
is not always easy to construct. Finally, the parareal algorithm as well as many 
of the matrix exponential or Laplace transform based approaches work relatively 
poorly or not at all for hyperbolic problems [J, la, llO|, while we will see that the 
Nievergelt approach behaves well for the wave equation. 

The price we pay for all of these good properties is that the computational 
cost of Nievergelt is very high when measured simply in terms of floating point 
operations. However, almost all of the computations can be done in parallel, so 
given our (admittedly extreme) assumptions on the computational environment, 
the Nievergelt approach is competitive for some problems. 

In the next section we approach questions of accuracy and scalability in the 
context of a very simple numerical model problem in order to clarify the properties 
of the Nievergelt method. Then in section [J] we present some results on graphics 
processing units, for which the Nievergelt algorithm is well suited, and discuss the 
costs and potential parallel speedup of the method in more depth. In Section O 
we take a slightly more theoretical look at the method and provide some error 
analysis. We apply the method to some larger and slightly more realistic problems 
on standard parallel hardware in Section [6l and we end with some concluding 
remarks. 
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Tabic 1. Errors for the Nicvcrgclt algorithm on four time domains, 
with different discretization sizes in time At and varying number M 
of Chebyshev interpolation in the initial value space, using backward 
Euler. 





M 










At 


3 


4 


5 


6 


7 


0.01 


0.0681 


0.0201 


0.0274 


0.0278 


0.0278 


0.005 


0.0511 


0.00751 


0.0138 


0.0141 


0.0141 


0.0025 


0.0422 


0.00098 


0.00672 


0.00700 


0.00698 


0.001 


0.0370 


0.00293 


0.00254 


0.00280 


0.00278 


0.0001 


0.0339 


0.00526 


0.000050 


0.000296 


0.000278 



3. A simple example 

To make the discussion concrete, consider the model nonhnear initial value problem 

y' = y\ y(0) = 1, (4) 

which has exact solution 



y 



l-t 



We will avoid the singularity at time t = 1 and consider this problem on the time 
interval T = [0,r] with final time T = 1/2. We assume the quantity of interest is 
simply y{T) and we calculate all our (absolute) errors with respect to that. 

Now we subdivide T into subparts Tj = \Tj-i,Tj\,j = 1 . . . A^ where Tq = 
0, T/v = T. On each part Tj we consider an initial value problem 



y =y 



yiTj- 



X.- 



■i-i 



(5) 



on Tj- Of course in general Xj-i is unknown, so we conceptually solve the problem 
for all possible Aj_i, constructing not just a scalar value y(T„) but a mapping (j>> 
that takes initial values to final values. 

We assume Aj_i is known to lie in some range H = [a,b]. We call S the initial 
value space and sample points inside H to get a discrete set H/j. Then, for each 
^ € H/i we run the initial value problem ([5]). When the initial value Aj_i becomes 
known (communicated to us from the previous processor in the final step of the 
algorithm), we use interpolation to calculate Xj. 

To understand how the interpolation process affects accuracy, we represent the 
initial value space H = [0, 2] as a Lagrange interpolant on the M Chebyshev nodes 
in this interval. We present some accuracy results in Table [H Intuitively we wish 
to balance the time-discretization errors with interpolation errors that are of the 
same order to get an efficient method. Since Chebyshev interpolation is spectrally 
accurate, we see in Tabled that we need just 5 or 6 Chebyshev points is enough to 
get 10~ accuracy. Also see Figure [H which shows the component of error related 
to interpolation as the time integration proceeds. We remark that the number of 
subdomains N does not significantly affect these results. 

For comparison we also solve this problem using a parareal method [la]. The 
parareal algorithm proceeds by iteratively applying the correction equation 



xfc+l 






(Ar')+5L(A,')-Gi^(A,'), 



(6) 



where Ay is the initial value for time slice (or processor) j in iteration k, g^t is a 
standard time stepping method with a short timestep, and Gat is a (cheap) coarse 
time stepping method with larger timestep At. To understand the two compared 
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Figure 1. The error of the Nievergelt approach with At = lO"** on a log scale on the vertical axis plotted 
against time on the horizontal axis. The solid line is the error with plain backward Euler timestepping 
and no parallel framework, while the dotted line shows the error using M = 6 Chebyshev points. The left 
figure has 3 parallel time intervals, and the right has 4 parallel time intervals. The interpolation does add 
to the error, but the amount is comparable to the original time discretization error. 



Table 2. Errors for the Nievergelt approach and the 
parareal approach, with At — 0.0001 and various num- 
bers of processors N. For Nievergelt we use M — 6 
Chebyshev nodes for interpolation. Both approaches use 
backward Euler for the fine time discretization, and the 
parareal results also use backward Euler with AT — 0.1 
for the coarse propagator. 







parareal 


parareal 


parareal 


N 


Nievergelt 


fc = 2 


A: = 3 


fc = 5 


1 


2.77e-4 


2.77e-4 


2.77e-4 


2.77e-4 


2 


8.50e-4 


2.77e-4 


2.77e-4 


2.77e-4 


4 


2.83e-4 


9.46e-3 


7.41e-5 


2.77e-4 


8 


2.79e-4 


1.27e-2 


2.50e-4 


2.77e-4 


16 


2.72e-4 


3.18e-3 


2.13e-4 


2.77e-4 


32 


2.54e-4 


9.49e-4 


2.68e-4 


2.76e-4 


64 


2.16e-4 


4.36e-4 


2.73e-4 


2.74e-4 



algorithms visually, see Figures [2] and [3l where we present the scheduling of the 
Nievergelt and parareal algorithms as in the figures in 19|]. Here we have made 
an analogy between the fine propagator of parareal and the construction of the 
map from initial to final conditions in the Nievergelt approach, but we should 
note that this latter operation is much more expensive (by a constant factor of 
M). However, like the fine propagation of parareal, the construction of this map 
is embarrassingly parallel, requiring no synchronization. Also, we have compared 
the coarse propagator of parareal to the interpolation in Nievergelt. These are 
quite different operations, and it is not obvious exactly how their costs compare, 
but they play similar roles in the algorithm. It is clear from these figures that 
Nievergelt involves significantly fewer communications, especially as the number of 
parareal iterations increases. 

The results of the comparison are given in Table [2j Here we use backward Euler 
as both the fine and coarse propagator, with fine discretization size At and coarse 
discretization size AT, and k represents the number of parareal iterations used. 
More iterations gives better accuracy, but recall that each iteration of parareal 
requires as much communication as the entire Nievergelt algorithm, and in Table 
[2] we see that even for this simple problem it usually takes a few iterations to get 
to discretization error. 

To end this section we summarize the Nievergelt approach in pseudocode in 
Figure HI 



April 25, 2013 2:16 International Journal of Computer Mathematics minimal 



A. T. Barker 




time 



1 2 3 

processor 



Figure 2. Timing and communication in the Nievergelt approach. In this figure the gray squares represent 
construction of the propagation operator (usually done by timestepping across Tj with several different 
initial conditions), the black squares represent computing the interpolation of this operator at an initial 
condition communicated from the previous processor, and white circles are communication steps. Compare 
with Figure [S] 



time 




2 3 

processor 



Figure 3. Timing and communication in the parareal method. In this figure the gray squares represent 
action of the fine level timestepping scheme, the black squares represent the action of the coarse timestep- 
ping scheme, and white circles are communication steps. This figure shows fc = 3 iterations of parareal. 
Compare with Figure [2] 



Tabic 3. Time to solution in microseconds and speedup of 
the GPU implementation of the Nievergelt method as com- 
pared to a simple serial implementation on the CPU. 

At N M I Ttotal Tcpu/ Ttotal | Tcstimato 



2-14 


32 


4 


193.6 


2.2 


200.4 


2-14 


64 


4 


187.5 


2.3 


202.5 


2-14 


128 


4 


208.4 


2.0 


237.3 


2-15 


32 


5 


253.7 


3.3 


261.2 


2-15 


64 


5 


228.5 


3.8 


233.0 


2-15 


128 


5 


257.8 


3.4 


252.5 


2-16 


32 


7 


388.3 


4.4 


443.7 


2-16 


64 


7 


338.9 


5.0 


324.2 


2-16 


128 


7 


392.0 


4.3 


298.1 



4. A graphics card implementation and cost model 



One possible use case for the Nievergelt method is on general purpose graphical 
processing units, which are highly parallel and energy efficient computing units 
that are widely and cheaply available due to their use in gaming hardware. Ef- 
fectively using this hardware is a challenge because they are most efficient when 
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Given. An initial value problem y' = f{t,y{t)),y{0) = yo, a 
sample S/j of the initial value space, a final time T, and A^ 
processors. 

Initialize. Let Tj = jT/N and assign the interval [Tj^i,Tj] to 
processor j. 
Computation phase, 
for each processor j: 
for each ^k ^ ^h- 

solve y' = f{t,y{t)),y{Tj^i) = ^k 
let X>; = y{Tj)_ 
construct approximating function </>(^) such that 

<i>{ik) = \). 

Communication phase, 
for each processor j: 

receive Aj_i from previous processor 

calculate \j = 4>{Xj-i) 

send Xj to the next processor 



Figure 4. The minimal communication Nievergelt approach in pseudocode. 



Table 4. Approximate measures of parameters in cost 
models l[7j, ([Sj, in microseconds, computed as a least- 
squares fit from the same computational experiments 
used to make Table [31 

cost symbol value 



advance one timestep (GPU) rp 

apply the Nievergelt map r]v 

GPU overhead tk 
advance one timestep (CPU) 



cpu 
F 



0.040 
0.701 
137. 
0.051 



the computational task can be broken down into very many independent threads, 
which is precisely what the Nievergelt algorithm does. In Table [3] we show the 
speedup of using the parallel capabilities of the GPU over the same computation 
performed in serial on the CPU, with times recorded in microseconds, for the non- 
linear scalar initial value problem, for various timestep sizes At. We partition the 
time interval into A^ time slices, for each of which we compute the map from initial 
to final conditions independently. For each of the A^ time slices we propagate M 
initial conditions so that we can use MN completely independent threads on the 
GPU. The computation is finished by doing some cheap interpolation on the CPU. 
We are using backward Euler for the timestepping and Chebyshev interpolation 
with M nodes (chosen to insure that the total error in the Nievergelt method is 
comparable to time discretization error) and doing all computations in single preci- 
sion. In each case the error for the Nievergelt implementation is comparable to the 
discretization error due to timestepping. The timing results include all the time re- 
quired to allocate memory, build the initial value arrays, transfer data to and from 
the GPU, and to perform the interpolation that evaluates the map from initial to 
final conditions. The CPU and GPU tests are run on the same machine, which has 
a six core AMD Opteron CPU and an nVidia Tesla C2075 GPU. This experiment 
demonstrates the potential for using the inherent parallelism of the graphics card 
to solve the problem faster than is possible with the CPU of the same machine, 
even though many more floating point operations are required. 

One possible model for the cost (in terms of time to solution) for the Nievergelt 
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method using the GPU implementation is 

Cg = -J^TF + Ntn + TK (7) 

where Tp is the time to advance a single timestep using backward Euler for a single 
initial value problem, Ti\f is a measure of the cost of applying the map that has been 
constructed (the final communication and interpolation step), and tk represents 
some fixed GPU startup costs. Predicting the time to solution for the experiments 
in Table [3] using parameter values from Table H] shows good agreement, as you can 
see in the last column of Table [3l 
The corresponding serial cost is simpler, given by 

Cc = n4P^ (8) 

The speedup Cc/Cq can then be written 

cp 

- --'- (9) 



Cc nTp'^ NuKp 

Cg ~ MriTp/N + Ntn + tr ^ Mn + N'^kn ' 



where we have ignored the lower order term tk because we are interested in the 
extreme case of large n and A^ and we have defined Kp = t^^ /tf and kn = tn/tf. 
In order to balance time discretization with interpolation errors, we will assume 
M ^ n"' for a < 1. In fact in model problem with Chebyshev interpolation we 
could choose M ~ log(n) but we will not assume we have such an exponentially 
convergent scheme in general. Then if for fixed problem size n we choose N'^ ~ 
K^ n""*"^ we get 

Cc ^ n-'/\(^+-y^KF ^ (^F_\ (l_„)/2 

which implies that for q < 1, the larger the problem is, the more speedup we can 
get. 

Similar to the cost model analysis of a parareal-type method in \4, Section 3], we 
see that the traditional parallel efficiency for our method is not as good as usual 
space-based parallelism. However, the goal is to make the best use of an available 
GPU that otherwise would not be doing useful work, and we see that it is possible 
to get speedup in this case. 



5. Error analysis 

Nievergelt proves some error bounds on his approach in [20|, using forward Euler 
and linear interpolation. We want to say something about the error in a slightly 
more abstract setting, with correspondingly abstract assumptions. 

To begin, fix j and consider the interval Tj = [Tj_i,Tj]. Let g^ be the exact 
solution operator that maps an initial condition at t = Tj_i to a final condition at 
t = Tj, and let (7^^ be a standard time-discretization method that approximates 
g^ . We will assume that our problem is well posed in the sense that g^ is Lipschitz 
as a function of the initial conditions, 

\g'{Ci)-g'{a)\<K\^i-U 6,6 GH. (10) 
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We evaluate g-'^^ for many initial conditions in order to construct an interpolating 
function 0-^^ which approximates the map gf^^j using some standard interpolation 

technique. (The map 5;^^ in turn approximates the exact operator g^ , which is 
of course what we really want.) Denote the true solution at t = Tj_i by \*_i, 
but of course we will instead have an approximation to this value communicated 
from a previous processor, denoted by Aj_i. We will also use g and g^t without 
superscripts to denote propagators on the whole time interval T = [Tq,Tn], and 
(pAt will be the Nievergelt method used over the entire time domain (including 
several interpolations) . 

We begin our analysis by considering the error at the end of the fixed time 
domain Tj: 

\^U^j-i)-9'{y,-i)\ < l</'it(A,-i) -5L(A,-i)| + Kt(^j-i)-g'{Xj-i)\ 

+ \g\Xj-i)-g\XU)l (11) 

where the first term is recognized as a standard interpolation error, the second 
term is a standard time discretization error, and the third term depends on how 
accurately the solution was computed on the previous time-slice Tj-i- Beginning 
with this last term and using (jlOp . we have 

|9^-(A,_i) - <7^-(A*_i)| < K\X,., - A*_i| = i^|0i-\A,_i) -5^--i(A*_2)|, 

which contains exactly the same type of expression as (jlip . so that we can use it 
recursively to get an expression for the final error, 

\^AtiXN-i)-9''{X*N-i)\ < NK (^max|c/>i,(A,_i) -5i,(A,-i)| 



+ max|5r^^(Aj_i) -5r^(Aj_.i)| j . 
We will require that our time discretization method satisfies 

\9iM3-1) - 9^{X,-i)\ < Cilo(At^), (12) 

where Ci is independent of the time interval Tj — Tj_i and k is the order of error 
in the underlying time discretization method, 

\gAt{Xo)-9{X*o)\<Di{T)0{At''). 

The standard error bounds for many time discretization methods (including linear 
multistep methods) have Di growing exponentially with the length of the time 
interval that we are integrating over, so that (1121) is not hard to satisfy [l|, Il4l |. 

The analogous condition for the interpolation is easy to state but perhaps harder 
to interpret. It is 

l0it(A,-i) - ffL(A,-i)| < C2^0iAe), (13) 

where i again represents the order of error of the underlying interpolation on the 
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initial value space: 

\cl^At{Xo)-9At{Xo)\<D2{T)0{Ae). 

Here we are assuming that the interpolating function to the time discretization 
operator is more accurate if the time integration is over a short interval than if it is 
over a long interval (cf. [20|]). If we use polynomial interpolation, standard results 
give us 



l0L(A,-i) -5L(A,-i)| < CaAe^+^max 



d'+'giM) 



9^P+i 



where p is the order of the polynomial. The quantity in absolute value can be 
interpreted as the D2{T) above, and in general can be expected to grow very 
quickly in T, so that ()13p is also a very plausible assumption for a wide range of 
problems — in particular it certainly holds for our simple model problem @j with 
linear interpolation. 
We put these simple results together in the following 

Theorem 5.1 Under the assumptions ([TOD, (fT^ and (fTOj) . we have 

WU^j-i) - 5^'(A*-i)l < O(AC^) + 0{/\t^), (14) 

and in particular if we balance interpolation and time discretization errors, the 
Nievergelt approach recovers the error of the underlying time discretization. 

For linear differential equations, the solution is also linear as a function of its 
initial condition, so that the interpolation error can be made zero using only enough 
points to construct a basis for the initial condition space, as shown in [2Q] . In the 
remainder of the paper we will focus on this simpler case. 

We close this section with a few comments on the cost required to achieve this er- 
ror. The Nievergelt algorithm is very computationally costly in conventional terms. 
In a sense, we trade off everything else against communication costs, assuming those 
to be dominant. The result is that floating point operation counts and storage costs 
are very high. For example, when applied to a nonlinear PDF discretized with r 
unknowns, the initial value space will have dimension r. If you sample uniformly 
in this high-dimensional space, with i points for each of the r unknowns, then the 
total number of operations to construct the propagation operator (or storage to 
store the final conditions) will grow as i'^ . If parallel operations are truly free we 
may be willing to pay this cost, or in some cases we may be able to sample much 
more efficiently than uniformly. 

In addition to the computational cost, in many dimensions the choice of how 
to sample the initial value space and how to perform the interpolation step is not 
straightforward. If some suitable interpolation can be applied, than the Nievergelt 
method can in principle be applied and will result in parallel speedup, but in many 
cases this will not be practical and some other algorithm must be sought. 

However, in the remainder of this paper we will consider linear problems, where 
the storage and cost issues can be greatly reduced and the interpolation issue 
disappears completely. In particular, if after space discretization we have the system 

±y = A{t)y + h{t), (15) 
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Table 5. Time to solution in 
seconds for the model prob- 
lem (1171 1 using various num- 
bers of processors A'^. Here 
At ^ 0.005 and Aa; ^ 0.1, wc 
are propagating from i — to 
t — 10 using backward Euler 
for timcstcpping. 



N 


^total 


-^comm 


1 


0.2870 





2 


1.0547 


1.71C-4 


4 


0.5264 


4.13e-4 


8 


0.2637 


4.51e-4 


16 


0.1322 


2.38e-4 


32 


0.06730 


6.16e-4 


64 


0.03453 


6.71e-4 


128 


0.01763 


6.13e-4 



then the solution is an affine function of the initial condition yo- With M points in 



the spatial discretization, then yo = X]i=i Vj^j ^^ the unit basis vectors Cj, and we 
can solve (J15p once with a zero initial condition and solve a homogeneous version 
of (|15p M times, once for each basis vector. This can be done with no knowledge 
of the true initial condition coefficients yj . Then we reconstruct the solution 

M 

y = 9At{0) + ^yjgAt{ej). (16) 



We use the notation (^At (rather than (/)At) because there is no need for interpolation 
in this linear setting and this technique introduces no additional error beyond 
rounding error. 



6. The heat equation and wave equation 

Having examined the Nievergelt approach numerically on a simple scalar initial 
value problem and done some analysis, we now turn to some results for slightly 
more complicated problems. First we consider the variable coefficient heat equation 

||=(l + ^sin(t))g + 6(t), x€[0,l] (17) 

in one dimension with homogeneous Dirichlet boundary conditions and with b{t) 
and the initial conditions chosen so that the true solution is given by 

y = cos(t) sin(7ra;). 

We note that (|17p has a nonconstant coefficient and so that methods on exponential 
quadrature or Laplace transforms 13|, ll5|, l2l|] and the paraexp method U| are not 



applicable. 

After discretization in space using standard centered finite differences and M 
spatial points, (I17p can be written in the form ()15p . On each parallel time slice we 
solve this problem M + 1 times with different initial conditions (and with b{t) 7^ 
only once) and reconstruct the solution using (I16p and a single communication 
step. 

Some numerical results are shown in Tabled where we can see a speedup of about 
15. Note that we are able to get this speedup partially because we are willing to 
accept poor spatial resolution. If more points in space are required, we will need 
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Table 6. Pararcal comparison to Tabic [5] with Ai — 0.005, AT — 0.1, and propagating to a 
final time of 10. 





fc = 2 




fc = 3 




k = A 




fc = 8 




N 


-^ total 


-^comm 


-'total 


-'comm 


-'total 


-'comni 


-'total 


^ conim 


2 


0.3047 


2.79e-5 


0.4533 


4.20e-5 


0.6028 


4.89e-5 


1.204 


1.06e-4 


4 


0.1500 


2.88e-4 


0.2250 


5.48e-4 


0.2994 


4.72e-4 


0.5986 


8.81e-4 


8 


0.07744 


6.22e-4 


0.1152 


7.53e-4 


0.1533 


6.56e-4 


0.3054 


1.03e-3 


16 


0.04020 


1.35e-3 


0.05938 


1.87e-3 


0.07721 


5.66e-4 


0.1543 


7.54e-4 


32 


0.02163 


1.72e-3 


0.03119 


1.57e-3 


0.04093 


1.85e-3 


0.07940 


1.56e-3 



Table 7. Time to solution 
in seconds for the model 
problem ( 117t using various 
numbers of processors A'^. 
Here Ai ^ 0.001 and Ax ^ 
0.05, we are propagating 
from i ^ to i ^ 10 
using backward Euler for 
timestepping. 



Af 


Jtotal 


1 comm 


1 


1.641 





2 


12.309 


4.84e-4 


4 


6.112 


2.26-3 


8 


3.068 


l.lle-2 


16 


1.530 


4.92e-3 


32 


0.7748 


1.12e-2 


64 


0.3859 


1.57e-3 


128 


0.1939 


6.07e-4 


256 


0.0984 


4.37e-4 



Table 8. Time to solution in 
seconds for the model prob- 
lem l |17D using various num- 
bers of processors A'". Here 
Ai ^ 0.0005 and Aa; ^ 0.01, 
propagating from i — to 
t — 10, using backward Eu- 
ler for timestepping. 



N 


^total 


^ comm 


1 


16.166 





32 


45.593 


0.230 


64 


22.901 


0.140 


128 


11.867 


0.203 


256 


6.130 


0.361 


512 


3.118 


0.0963 


1024 


1.595 


0.0780 



Table 9 . Timing comparisons 
method. Here At ^ 0.005 and 

delay of 0.001 seconds for every 



between parareal and the Nievergelt minimal communication 
we use backward Euler and include an artificial communication 
receive. The parareal method uses AT — 0.1. 





Nievergelt 




fc = 2 




k = 4 




fc = 8 




N 


Ttotal 


./comm 


^total 


-/comm 


Ttotal 


-t comm 


^total 


-t comm 


1 


0.2870 

















2 


1.0690 


4.31e-3 


0.3251 


2.37e-2 


0.6349 


3.25e-2 


1.4513 


7.98e-2 


4 


0.5360 


7.82e-3 


0.1680 


1.76e-2 


0.3568 


5.66e-2 


0.7220 


1.21e-l 


8 


0.2834 


1.89e-2 


0.1105 


3.38e-2 


0.2049 


5.41e-2 


0.4199 


1.16e-l 


16 


0.1500 


1.73e-2 


0.07022 


3.18e-2 


0.1347 


5.85e-2 


0.2512 


9.97e-2 


32 


0.08318 


1.64e-2 


0.05538 


3.55e-2 


0.1067 


6.75e-2 


0.2002 


1.23e-l 


64 


0.04521 


1.15e-2 


0.04460 


3.44e-2 


0.08416 


6.43e-2 


0.1645 


1.25e-l 



many more processors to see speedup, but there is no real barrier to doing this 
provided the processors are available. Scaling results for a parareal implementation 
of this problem are shown in Table [H See also Tables [7] and [8] which achieve similar 
speedups for larger problems with more processors. 

Comparisons of the communication times in Tables [5] and [6] show that we do not, 
in fact, live in a world where parallel floating point operations are free compared 
to communication time, and the Nievergelt algorithm is competitive with parareal 
only if we require an unrealistically large number of parareal iterations. In Table 
[9] we simulate additional latency by enforcing a delay of one millisecond for every 
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Tabic 10. Time to solution 
for model problem {TsJ with 
M — 40 Chebyshcv points, 
integrating to a final time of 
16. 



TV 


^total 


^ comm 


1 


3.193 





2 


64.950 


0.0782 


4 


32.453 


0.00122 


8 


16.334 


0.0907 


16 


8.163 


0.0416 


32 


4.124 


0.00121 


64 


2.061 


0.0299 


64 


2.049 


0.0288 


128 


1.039 


0.0223 


256 


0.552 


0.0620 



message that is received from another processor, and it is clear that in this simu- 
lated high-latency environment the additional communication steps of the parareal 
method are a drawback, especially for problems where a large number of iterations 
might be required. As parallel computations become cheaper relative to communi- 
cation, the Nievergelt approach becomes more and more promising in comparison 
to the parareal method. 

The parareal algorithm is reported to work relatively poorly for second-order 
problems and for hyperbolic problems [4l. IsMlOl] . and many of the other time-parallel 
methods we considered in the introduction are restricted to parabolic problems 
[71, ll5|, l2]|. As our final numerical example we consider the Nievergelt method 
applied to these types of problems. In [8, |9|, a version of the parareal algorithm 
for second-order and hyperbolic algorithms is presented — however, this method 
requires the communication of more initial conditions from more distant processors 
(not just the immediately preceding time slice), and so is not competitive in trying 
to minimize communication. 

We will use an example problem from [22], Program 19, which is the wave equa- 
tion 






X G [0, 1] (18) 



with homogeneous Dirichlet boundary conditions and with the initial conditions 
such that the solution is a single peaked wave propagating to the left. This problem 
is discretized in space with a spectral method on Chebyshev points, and time- 
stepping is with the explicit second-order leapfrog method. We will use a varying 
number M of spatial points but always choose a timestep of At = 8/M^ to satisfy 
a CFL condition. See [22l | for details for this model problem. 

The time-domain decomposition proceeds much as it did for the heat equation, 
the only important difference being that we have two initial conditions (data from 
the previous timestep and the step before that) to deal with and to transfer from 
subdomain to subdomain. This has the effect of doubling the initial value space, 
that is, if we have M = 80 spatial points, we need to propagate 160 initial conditions 
to get the full map from initial conditions to final conditions. 

Numerical results are shown in Tables [10] and [TTl 



7. Conclusion 

We have examined the method of Nievergelt for the parallel solution of time- 
dependent problems, a method which essentially builds the time propagation op- 
erator on each time-slice in parallel with no knowledge of the initial conditions. 
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Table 11. Time to solution 
for model problem (|18| l with 
iVf ^ 80 Chebyshev points, 
integrating to a final time of 
20. 



N 


^total 


-tcomm 


1 


9.058 





16 


90.969 


0.399 


32 


45.673 


0.291 


64 


22.845 


0.150 


128 


11.454 


0.104 


256 


5.809 


0.211 


512 


3.030 


0.212 



It is clear from a viewpoint of counting arithmetic operations that the Nievergelt 
approach is not efficient, and it may never be suitable for nonlinear problems 
with a large number of degrees of freedom. However, we are moving to a future 
where computational cores are essentially free — hundreds of them may be sitting 
unused on your desktop, and if there is any way to use them profitably it may be 
worthwhile even if the efficiency is poor. In addition, for some applications the total 
time to solution is the only concern, and the amount of computational resources 
spent is less important. It is this kind of hardware environment and these kind of 
applications that the parareal algorithm was invented to deal with. The Nievergelt 
method compares well to the parareal algorithm but has the important advantage 
of always requiring less communication, and it is applicable to more situations than 
methods based on exponential quadrature or Laplace transforms. 

In our work above, we have shown speedup of 3 on 32 processors for a scalar 
ODE problem, a speedup of 10 on 1024 processors for a variable coefficient heat 
equation and speedup of 5 on 256 processors for a wave equation. These results are 
somewhat worse than those in published parareal implementations, but roughly 
comparable. The original parareal paper [16] predicted speedups of 8 to 18, and in 
|l7l] the authors get speedup of 5 or 6 on 200 processors. The best speedups in [J] 
and [5[ are from 4 to 6. For the different paraexp method, speedups are reported 
to be from 4 to 6 [11]. In addition to speedup on traditional parallel clusters, we 
have seen that implementing the Nievergelt algorithm on a GPU provides modest 
speedup, with this case being of particular interest because many computational 
scientists will have a GPU sitting idle in their desktop while they wait for a CPU 
to compute their results. 

The assumptions that we are using in this work, namely that parallel computa- 
tion is so cheap as to be nearly free when compared to communication costs, are not 
satisfied on traditional parallel machines. However, to some extent this assumption 
holds on modern graphics hardware, and we have shown that algorithms can be 
designed for a future world where this kind of pattern is more common. Even if 
the very high computational and storage costs of the Nievergelt method make it 
impractical in many cases, we argue that something like it could influence at least 
part of the design of future parallel methods and that understanding this extreme 
case may point the way to exploiting future hardware where communication costs 
dominate computational costs. Aside from the model problem at the beginning, 
we have focused on linear problems, because in this case it is easier to construct 
and represent the map from initial conditions to final conditions. In the general 
nonlinear case, using this map will require approximating it and using some high- 
dimensional interpolation scheme, which may be prohibitively expensive. However, 
there is no additional communication necessary for nonlinear problems. 

We have divided our time domain into slices and assigned each slice to a single 
processor on a parallel machine, but it is worth noting that there is a great deal 
more parallelism inherent in the problem than we have exploited. The map from 
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initial to final conditions is constructed by integrating in time a large number of 
initial conditions given by some discretization of the initial condition space, and in 
principle each of these time integrations could be done on a separate processor or 
processing core in parallel. The lack of need for synchronization in the whole algo- 
rithm (except at the final step) suggests the possibility of a distributed computing 
model for some applications. 

There are several possible directions for future research. One is to combine the 
Nievergelt approach with a parareal method in some way, to get a balance of the 
communication costs versus the greatly added computation of Nievergelt. Another 
is to do more realistic nonlinear problems so as to more carefully consider the 
role of interpolation in the method. Ongoing research is to combine the traditional 
parallel implementation with the GPU implementation to see the potential of the 
method on heterogeneous systems. 
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